Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add a generic ClassificationSummary trait

Why are the changes needed?

Add a generic ClassificationSummary trait so all the classification models can use it to implement summary.

Currently in classification, we only have summary implemented in LogisticRegression. There are requests to implement summary for LinearSVCModel in https://issues.apache.org/jira/browse/SPARK-20249 and to implement summary for RandomForestClassificationModel in https://issues.apache.org/jira/browse/SPARK-23631. If we add a generic ClassificationSummary trait and put all the common code there, we can easily add summary to LinearSVCModel and RandomForestClassificationModel, and also add summary to all the other classification models.

We can use the same approach to add a generic RegressionSummary trait to regression package and implement summary for all the regression models.

Does this PR introduce any user-facing change?

How was this patch tested?

existing tests

* Field in "predictions" which gives the probability or rawPrediction of each class as a vector.
*/
@Since("3.1.0")
def scoreCol: String
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't name it to probabilityCol because not all the classifiers have probabilityCol. If no probability, will use rawPrediction instead to calculate the binary classification metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards-compatibility, keep probabilityCol in the subclass as an alias? or else that's a breaking change?
Would we ever want to expose both a probability and some raw score like logits? That might argue for some mixin for probabilityCol or something but that may get too complex.

@SparkQA
Copy link

SparkQA commented Jun 3, 2020

Test build #123450 has finished for PR 28710 at commit 91a121b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly makes

* Field in "predictions" which gives the probability or rawPrediction of each class as a vector.
*/
@Since("3.1.0")
def scoreCol: String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards-compatibility, keep probabilityCol in the subclass as an alias? or else that's a breaking change?
Would we ever want to expose both a probability and some raw score like logits? That might argue for some mixin for probabilityCol or something but that may get too complex.

@huaxingao
Copy link
Contributor Author

I put probabilityCol back in the subclass and this should fix the test failure I had yesterday :)
For the multiclass classification or binary classification metrics we need, seems to me that either one of probability or raw prediction is good enough.

@huaxingao
Copy link
Contributor Author

also cc @zhengruifeng

@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123505 has finished for PR 28710 at commit a4b2891.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123509 has finished for PR 28710 at commit 54124a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123543 has finished for PR 28710 at commit f42fa8e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _ClassificationSummary(JavaWrapper):
  • class _TrainingSummary(JavaWrapper):
  • class _BinaryClassificationSummary(_ClassificationSummary):
  • class LogisticRegressionSummary(_ClassificationSummary):
  • class LogisticRegressionTrainingSummary(LogisticRegressionSummary, _TrainingSummary):
  • class BinaryLogisticRegressionSummary(_BinaryClassificationSummary,

@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123545 has finished for PR 28710 at commit 7b01b63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

//[SPARK-31840] Add instance weight support in LogisticRegressionSummary
// weightCol in org.apache.spark.ml.classification.LogisticRegressionSummary is present only in current version
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.weightCol")
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.weightCol"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking through what if anything breaks here in user code. Are all the subclasses sealed traits? so could any user code have subclassed the old traits in a way that would then not compile now? I don't think so but I'm just going off reading the diff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general question on the Python side; I'm not as sure about the possibility of breaking something there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before change, the code structure is like this:

sealed trait LogisticRegressionSummary extends Serializable
sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary
sealed trait BinaryLogisticRegressionSummary extends LogisticRegressionSummary
sealed trait BinaryLogisticRegressionTrainingSummary extends 
  BinaryLogisticRegressionSummary with LogisticRegressionTrainingSummary

So seems OK to me, no users can subclass the old traits.

Python seems OK to me too. I only added a new method. Applications written for old versions of Spark shouldn't be affected by this.

ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$_setter_$org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics_="),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics"),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary.weightCol"),
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.asBinary"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though they are sealed traits, could possibly the change of return type like this break in user code? Users get a BinaryClassificationSummary instead of BinaryLogisticRegressionSummary, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thanks for your comment. I think we are still OK, because summary is still an instance of BinaryLogisticRegressionSummary

val blorModel = lr.fit(smallBinaryDataset)
assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummary])
assert(blorModel.summary.asBinary.isInstanceOf[BinaryLogisticRegressionSummary])
assert(blorModel.binarySummary.isInstanceOf[BinaryLogisticRegressionTrainingSummary])

Similarly, in multinomial case, summary is still an instance of LogisticRegressionTrainingSummary

val mlorModel = lr.setFamily("multinomial").fit(smallMultinomialDataset)
assert(mlorModel.summary.isInstanceOf[LogisticRegressionTrainingSummary])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in multinomial case, summary is still an instance of LogisticRegressionTrainingSummary

Is it expected?
I think a reasonable behavior should be:
for any classification model, a ClassificationSummary should be attached;
if and only if numClasses=2, then this summary is also a BinaryClassificationSummary

import sparkSession.implicits._

// TODO: Allow the user to vary the number of bins using a setBins method in
// BinaryClassificationMetrics. For now the default is set to 100.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'For now the default is set to 100.' it is wrong, the default numBins in BinaryClassificationMetrics is always 1000.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code was used in LogisticRegression and I moved it here. I'm not really sure what the comment means, but it might mean "Allow the user to vary the number of bins using a setBins method in BinaryClassificationMetrics. For now set it to 100."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I suggest changing it to the default value of 1000, so that the metrics are the same with evaluator and BinaryClassificationMetrics with default numBins

predictions.select(col(scoreCol), col(labelCol).cast(DoubleType),
lit(1.0)).rdd.map {
case Row(score: Vector, label: Double, weight: Double) => (score(1), label, weight)
}, 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

* Field in "predictions" which gives the probability or rawPrediction of each class as a vector.
*/
@Since("3.1.0")
def scoreCol: String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since probabilityCol can be added in subclasses, so can scoreCol be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to keep scoreCol in this generic trait because it is needed in BinaryClassificationSummary

    new BinaryClassificationMetrics(
      predictions.select(col(scoreCol), col(labelCol).cast(DoubleType),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add another trait ProbabilisticClassifierSummary to avoid this scoreCol? it seems does not exist in current impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is no rawPredictionCol?

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123789 has finished for PR 28710 at commit c2fb2b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

MulticlassMetrics needs "prediction, label, weight(optional), probability (optional, only for logLoss)", probability can be ignored for now;

BinaryClassificationMetrics needs "score, label, weight", here score is the probability of class 1 in BinaryLogisticRegressionSummary;

So I guess we can add a BinaryClassificationSummary to support LinearSVC, then scoreCol seems no longer needed, several break changes maybe avoided.

@huaxingao
Copy link
Contributor Author

@zhengruifeng Thanks for your comments.
I think the traits should be like this (not including weightCol etc. for simplicity):

trait ClassificationSummary {
  def predictionCol: String
  def labelCol: String
  val multiclassMetrics = 
      new MulticlassMetrics(
         predictions.select(col(predictionCol), col(labelCol).cast(DoubleType))
        . . . . . .
     )

trait BinaryClassificationSummary extends ClassificationSummary
  def rawPredictionCol: String
  val binaryMetrics = 
    new BinaryClassificationMetrics(
      predictions.select(col(rawPredictionCol), col(labelCol).cast(DoubleType))
      . . . . . . 
    )

However, currently BinaryLogisticRegressionSummary uses probabilityCol for BinaryClassificationMetrics and this probabilityCol is in LogisticRegressionSummary instead of BinaryLogisticRegressionSummary. In order not to break the existing code, I need to make several changes for the above traits

  1. change rawPredictionCol to scoreCol
    can't use rawPredictionCol since currently LogisticRegression uses probabilityCol
    can't use probabilityCol since LinearSVC doesn't have probabilityCol
  2. put scoreCol in ClassificationSummary (since currently probabilityCol is in LogisticRegressionSummary instead of BinaryLogisticRegressionSummary)
    that's how I get the current traits as following:
trait ClassificationSummary {
  def scoreCol: String
  def predictionCol: String
  def labelCol: String
  val multiclassMetrics = 
      new MulticlassMetrics(
         predictions.select(col(predictionCol), col(labelCol).cast(DoubleType))
        . . . . . .
     )

trait BinaryClassificationSummary extends ClassificationSummary
  val binaryMetrics = 
    new BinaryClassificationMetrics(
      predictions.select(col(scoreCol), col(labelCol).cast(DoubleType))
      . . . . . . 
    )

To implement summary for other classifiers:

LinearSVCSummary extends BinaryClassificationSummary  // use rawPredicationCol
FMClassifierSummary extends BinaryClassificationSummary  // use ProbabilityCol

For RandomForestClassifer (also for DecisionTreeClassifier and GBTClassiifer)

RandomForestSummary extends ClassificationSummary
BinaryRandomForestSummary extends BinaryClassificationSummary  // use ProbabilityCol
if (numOfClass == 2)
   summary = BinaryRandomForestSummary
else
   sumary = RandomForestSummary

@zhengruifeng
Copy link
Contributor

@huaxingao
since scoreCol is only used to compute BinaryClassificationMetrics, so what about putting it in BinaryClassificationSummary:

trait BinaryClassificationSummary extends ClassificationSummary {
 def scoreCol: String = null
}

Then in LogisticRegression:

LogisticRegressionSummary extends ClassificationSummary {
   def probabilityCol = "..."
}

trait BinaryLogisticRegressionSummary extends LogisticRegressionSummary with BinaryClassificationSummary {
    def scoreCol = if (probabilityCol.nonEmpty) probabilityCol else if (rawPredictionCol.nonEmpty) rawPredictionCol else throw ...
}

@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123980 has finished for PR 28710 at commit b8aeff6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124021 has finished for PR 28710 at commit 2e6f35c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124030 has finished for PR 28710 at commit 2e6f35c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

/**
* Abstraction for binary classification results for a given model.
*/
trait BinaryClassificationSummary extends ClassificationSummary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[classification]?

@Since("1.5.0")
@transient lazy val recallByThreshold: DataFrame = {
binaryMetrics.recallByThreshold().toDF("threshold", "recall")
override def scoreCol: String = if (probabilityCol.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if probabilityCol.isEmpty use rawPredictionCol instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since existing computation of binary did not use rawPredictionCol, so I think we can just let it alone for now.
In the furture, I think we can make it a little heuristic: if probabilityCol.nonEmpty use probabilityCol; then try rawPredictionCol

ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionSummary.org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics"),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionSummary.weightCol")
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there some MiMa execlusions can be removed after the lastest change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiMa exclusions have already been updated

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124045 has finished for PR 28710 at commit 0898708.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124110 has finished for PR 28710 at commit 61e93d0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124128 has finished for PR 28710 at commit 61e93d0.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124129 has finished for PR 28710 at commit 61e93d0.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124131 has finished for PR 28710 at commit 61e93d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, since it is a big change with mima exclusions,
@viirya @srowen could you please help reviewing the latest commits?

@huaxingao
Copy link
Contributor Author

I moved asBinary back to LogisticRegressionSummary to get rid of this

ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.asBinary")

All the rest of the MiMa problems are InheritedNewAbstractMethodProblem. I think those are OK.

@huaxingao
Copy link
Contributor Author

retest this please

/** Number of training iterations. */
@Since("3.1.0")
def totalIterations: Int = {
assert(objectiveHistory.length > 0, s"objectiveHistory length should be greater than 1.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert looks like objectiveHistory length should be greater than 0?

And string interpolation s"" is unnecessary.

override def scoreCol: String = if (probabilityCol.nonEmpty) {
probabilityCol
} else {
throw new SparkException(s"probabilityCol is required for BinaryLogisticRegressionSummary.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s"" is not needed.


@property
@since("2.0.0")
@since("3.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this since version? Like above probabilityCol is not changed?

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124197 has finished for PR 28710 at commit e844c11.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124198 has finished for PR 28710 at commit e844c11.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124206 has finished for PR 28710 at commit c76d591.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines -1223 to -1225
.. note:: This ignores instance weights (setting all to 1.0) from
`LogisticRegression.weightCol`. This will change in later Spark
versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you remove such note in latest commit. So now these methods use weightCol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added weightCol in another PR, but forgot to update python comment.

//[SPARK-31893] Add a generic ClassificationSummary trait
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$_setter_$org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics_="),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics"),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionTrainingSummary.weightCol"),
Copy link
Member

@viirya viirya Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is for all cases. But this looks like just because weightCol is inherited from ClassificationSummary instead of LogisticRegressionSummary now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added weightCol in LogisticRegressionSummary in another PR (I just realized the MiMaExclude for that PR is not needed any more since I moved weightCol to ClassificationSummary)

3.0: no weightCol

after my changes:
ClassificationSummary has weightCol, LogisticRegressionSummary, LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary, BinaryLogisticRegressionTrainingSummary all have inherited weightCol, so there are four InheritedNewAbstractMethodProblem for weightCol

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124227 has finished for PR 28710 at commit ead6da2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay. The InheritedNewAbstractMethodProblem mima problem looks because moving some methods to the traits created in this change.

@huaxingao
Copy link
Contributor Author

@srowen Any more comments?

@srowen
Copy link
Member

srowen commented Jun 19, 2020

Just catching up here: so all the MiMa excludes don't actually represent public API changes right? they're all sealed or private?
We're going to lose the @Since annotations, which is a little unfortunate, but I can see the value in refactoring

@huaxingao
Copy link
Contributor Author

Right. They are all sealed or private. No public API changes. @srowen

@srowen
Copy link
Member

srowen commented Jun 20, 2020

Merged to master

@srowen srowen closed this in 297016e Jun 20, 2020
@huaxingao
Copy link
Contributor Author

Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants